For this project, we used multiple online databases, described below.
Liz, Kavyaa, and I webscraped for information publicly available in multiple databases to determine the size and abundance of tRNA fragments. We are using that raw data to learn more about them in terms of their associated proteins and aim to determine the function of these conserved fragments. Datamining has proven to be a helpful way for us to study tRNA fragments outside of the typical wet lab setting. Our ultimate goal is to use these data to inform our future experiments using TGIRT-seq. Using TGIRT-seq will also give us the greater processivity of TGIRT reverse transcriptases across the difficult structures and modifications of tRNAs and their fragments.
We primarily used three R scripts. In the first, we loaded in the data from the tRFdb, formatted it to contain the information we were looking for, and wrote it into a csv file so it could be accessed even when the website is down. The second is a loop that allowed us to do the actual webscraping from the tRFdb website to determine the nucleotide counts found in their deep sequencing data that mapped to their tRNA construct for each gene. Finally, we used the information we webscraped to create plots for the visualization of the fragment sizes and abundance.
Prepwork Script
#Here, we read in the data from the tRFdb and formatted it to add columns that contained the start and end of the map positions.
tRFdb_human_tRF3_list <- read_csv("data/trfdb_human_tRF3.csv")
View(tRFdb_human_tRF3_list)
tRFdb_human_tRF3_list_mod <- tRFdb_human_tRF3_list %>%
mutate(tRF_start = str_match(`tRF Map Positions`, "(?<=Start:)\\d+"),
tRF_end = str_match(`tRF Map Positions`, "(?<=End:)\\d+"))
#Then, we wrote it into a csv file and gave it a relative file path.
write_csv(tRFdb_human_tRF3_list_mod, "data/trfdb_human_tRF3_update.csv")
#Here, we use a reference URL to make a list of all the experimental data
url <- "http://genome.bioch.virginia.edu/trfdb/experiments_display.php?trf_id=3023a&organism=human"
#We read in the html file as a table, and format it as a dataframe, from which we extract the first element of the list.
experiments <- read_html(url) %>%
html_nodes("table") %>%
html_table(fill = TRUE) %>%
.[[1]]
#Then, we filter for the columns of the dataframe that we are specifically looking at and write it into a csv file, so it can be accessed even when the website is down.
experiments <- experiments %>%
select(Experiment,Source)
View(experiments)
write_csv(experiments, "data/tRFdb_experiment_list.csv")
#Finally, we cut down this whole dataset to look specifically at the data from the experiments we want, which are the HEK and PARCLIP experiments, and write the trimmed list into its own csv file.
HEK_experiments <- experiments %>%
filter(str_detect(Source, "HEK") | str_detect(Source, "PARCLIP"))
View(HEK_experiments)
write_csv(HEK_experiments, "data/HEK_experiment_list.csv")
Loop Script We use this to pull out the gene names and coordinates to build links to webscrape nucleotide abundance data for the tRNA gene we are looking at.
#Here, we are assigning variables to be used later in the script. We place this at the top of the script to make it easier to change the tRNA gene (with or without a specific anticodon) we are looking at.
tRNA <- "Phe"
formatted_tRF_filename <- "data/tRF-PheGAA-tRFdb-HEK-abundance.csv"
#We are assigning variables that we will be using later to build the links.
HEK_experiment_IDs <- HEK_experiments %>%
pull(Experiment)
HEK_experiment_description <- HEK_experiments %>%
pull(Source)
#Use this code to peek at the dataset to see how the tRNA names were formatted, so that we can filter for them later on.
tRFdb_human_tRF3_list[1,"tRNA Name"]
View(tRFdb_human_tRF3_list)
#Filter the dataset for the tRNA you are specifically looking at and assign its gene coordinates to a variable for later use.
tRF_gene_coordinates <- tRFdb_human_tRF3_list %>%
filter(str_detect(`tRNA Name`, tRNA)) %>%
pull(`tRNA Gene Co-ordinates`) %>%
unique()
#Filter again, but this time, create a variable that includes the tRNA gene name.
tRF_gene_names <- tRFdb_human_tRF3_list %>%
filter(str_detect(`tRNA Name`, tRNA)) %>%
pull(`tRNA Name`) %>%
unique()
#Here, we assign all the componants of the tRFdb links to smaller variables so that we can use them to build an entire link specific for the genes we are looking at.
link1 = "http://genome.bioch.virginia.edu/trfdb/"
link2 = "graph.php?exp="
link3 = HEK_experiment_IDs
link4 = "&org=human&part1="
link5 = tRF_gene_coordinates
link6 = "&part2="
link7 = tRF_gene_names
#We use nested for loops to build the links to the gene we want and to read the file and convert it into a table that is readable in R, which we then input into a large dataframe with all of the results of the loops.
all_HEK_TRF_nuc_freq <- tibble()
x <- tibble()
for (experiment in 1:length(HEK_experiment_IDs)) { # for each of the experiments we're interested
for (gene in 1:length(tRF_gene_names)) { # for each of the tRNA genes we're interested
gene_df = paste(link1,link2,link3[experiment],link4,link5[gene],link6,link7[gene],sep = "") %>% # build a url address
read_html() %>% # read that url into R
html_table() %>% # convert the R file into a table
.[[1]] %>% # we specify that want just the first element of the list created by html_table
as_tibble() %>%
mutate("tRNA Gene"=link7[gene],
"Nucleotide Position"=COUNT,
"Experiment" = HEK_experiment_description[experiment],
"Experiment#" = HEK_experiment_IDs[experiment]) # here I'm formatting the dataframe with extra columns
all_HEK_TRF_nuc_freq = rbind(all_HEK_TRF_nuc_freq, gene_df) # this binds all the dataframes created by the for loops together in one big dataframe
}}
#Write this new large dataframe into a csv file
#write_csv(all_HEK_TRF_nuc_freq, formatted_tRF_filename)
formatted_tRF_filename <- "data/tRF-His-tRFdb-HEK-abundance.csv"
all_HEK_tRF_freq <- read_csv(formatted_tRF_filename)
all_HEK_tRF_freq
## # A tibble: 6,300 x 6
## COUNT RPM `tRNA Gene` `Nucleotide Position` Experiment `Experiment#`
## <dbl> <dbl> <chr> <dbl> <chr> <chr>
## 1 1 110. chr9.trna7-HisGTG 1 HEK293 GSM416733
## 2 2 161. chr9.trna7-HisGTG 2 HEK293 GSM416733
## 3 3 171. chr9.trna7-HisGTG 3 HEK293 GSM416733
## 4 4 183. chr9.trna7-HisGTG 4 HEK293 GSM416733
## 5 5 185. chr9.trna7-HisGTG 5 HEK293 GSM416733
## 6 6 187. chr9.trna7-HisGTG 6 HEK293 GSM416733
## 7 7 187. chr9.trna7-HisGTG 7 HEK293 GSM416733
## 8 8 217. chr9.trna7-HisGTG 8 HEK293 GSM416733
## 9 9 225. chr9.trna7-HisGTG 9 HEK293 GSM416733
## 10 10 231. chr9.trna7-HisGTG 10 HEK293 GSM416733
## # … with 6,290 more rows
trfdb_human_tRF3_update2 <- read_csv("data/trfdb_human_tRF3_update2.csv")
all_HEK_tRF_freq_update2 <- left_join(all_HEK_tRF_freq,trfdb_human_tRF3_update2, by = c("tRNA Gene"="tRNA Name")) %>%
select(`Nucleotide Position`, Experiment, RPM, gtRNA_ID_abbrev)
all_HEK_tRF_freq_update2
## # A tibble: 12,600 x 4
## `Nucleotide Position` Experiment RPM gtRNA_ID_abbrev
## <dbl> <chr> <dbl> <chr>
## 1 1 HEK293 110. TRH-GTG-1-6
## 2 1 HEK293 110. TRH-GTG-1-6
## 3 2 HEK293 161. TRH-GTG-1-6
## 4 2 HEK293 161. TRH-GTG-1-6
## 5 3 HEK293 171. TRH-GTG-1-6
## 6 3 HEK293 171. TRH-GTG-1-6
## 7 4 HEK293 183. TRH-GTG-1-6
## 8 4 HEK293 183. TRH-GTG-1-6
## 9 5 HEK293 185. TRH-GTG-1-6
## 10 5 HEK293 185. TRH-GTG-1-6
## # … with 12,590 more rows
Graph Script - Overlay This script gives us a good idea of the fragments generated from each experiment, including their sizes and abundance, relative to one another.
#Here, we provide the setup information, where we can quickly and easily replace the tRNA name with the one we are looking for.
formatted_tRF_filename <- "data/tRF-His-tRFdb-HEK-abundance.csv"
graph_name_html <- file.path(getwd(), "graphs-plotly", "tRF-His-tRFdb-HEK-abundance-overlay.html")
all_HEK_tRF_freq <- read_csv(formatted_tRF_filename)
#The, we left join our two datasets to add the correct gtRNA ID's to the plot
all_HEK_tRF_freq_update2<- left_join(all_HEK_tRF_freq,trfdb_human_tRF3_update2, by = c("tRNA Gene"="tRNA Name")) %>%
select(`Nucleotide Position`, Experiment, RPM, gtRNA_ID_abbrev)
graph_title <- "tRNA-His fragment coverage"
#Establish colors you want to use in the graph
color_blind_friendly_cols<-c("#d7191c","#fdae61","#abd9e9","#2c7bb6","#000000")
# Use the variables we had previously created to graph the nucleotide counts (in reads per million reads), colored by each experiment the data came from.
graph <- all_HEK_tRF_freq_update2 %>%
ggplot(aes(x=`Nucleotide Position`, y=RPM, color= Experiment, label = `gtRNA_ID_abbrev`))+
geom_path()+labs(title = graph_title)
graph
#Finally, load into ggplot to make into an interactive graph (so you can mouse over a fragment and look at the exact counts), and view it!
p <- ggplotly(graph)
p
Graph Script - Faceted This script just includes a different way to view the nucleotide counts so that the graph does not look so cluttered. Instead, the graph is recreated for each gene and oriented next to the others.
#Graph the tRF frequencies!
p = all_HEK_tRF_freq %>%
ggplot(aes(x=`Nucleotide Position`, y=RPM, color= Experiment))+
geom_line()+
scale_color_manual(values = color_blind_friendly_cols)+
facet_grid(`tRNA Gene` ~ .)+
labs(title = "tRNA-His fragment coverage")
#This line of code graphs only experiments that are *not* from the HEK293 experiments.
p2 = all_HEK_tRF_freq %>%
filter(Experiment != "HEK293") %>%
ggplot(aes(x=`Nucleotide Position`, y=RPM, color= Experiment))+
geom_line()+
scale_color_manual(values = color_blind_friendly_cols)+
facet_grid(`tRNA Gene` ~ .)+
labs(title = "tRNA-His fragment coverage")
#Here, we can further format the graph to hide the panel borders and remove the grid lines.
final_plot = p + theme(
panel.background = element_blank(),
panel.border = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.line = element_line(colour = "black")
)
#View the graph!
final_plot
In the Histidine plot, we can see very well-defined and uniform fragment sizes, many around 22 nucleotides long, for all the datasets analyzed. Furthermore, we can see a very low association with AGO2 proteins and high association with AGO proteins 3 and 4.
Lysine CTT
Here, we see a much greater variation in fragment size. Additionally, by comparing the scale of this graph to the previous one, we can see that many more fragments were obtained by the AGO-pull down assays with LysCTT than His. We see that the HEK experiments obtained more fragments, which is expected since that dataset contains the entire small RNA population. Different fragment nucleotide locations are associated in greater abundance with some of the AGO proteins, with AGO3 interacting with 5’tRNA fragments and internal tRNA fragments. Interestingly, the 3’tRNA fragments around position 59-76 have some similarities with AGO-association as tRF-His, with AGO4 proteins having the most associated reads, and few reads associated with AGO2. A difference in this case, though, is that the next highest abundance reads are with AGO1, followed by AGO3 (use the plotly graph, zoom and pan interactive functions to zoom into this region).
## # A tibble: 5,715 x 6
## COUNT RPM `tRNA Gene` `Nucleotide Position` Experiment `Experiment#`
## <dbl> <dbl> <chr> <dbl> <chr> <chr>
## 1 1 1874. chr15.trna2-LysCTT 1 HEK293 GSM416733
## 2 2 1886. chr15.trna2-LysCTT 2 HEK293 GSM416733
## 3 3 2008. chr15.trna2-LysCTT 3 HEK293 GSM416733
## 4 4 2082. chr15.trna2-LysCTT 4 HEK293 GSM416733
## 5 5 2108. chr15.trna2-LysCTT 5 HEK293 GSM416733
## 6 6 2152. chr15.trna2-LysCTT 6 HEK293 GSM416733
## 7 7 2170. chr15.trna2-LysCTT 7 HEK293 GSM416733
## 8 8 2191. chr15.trna2-LysCTT 8 HEK293 GSM416733
## 9 9 2204. chr15.trna2-LysCTT 9 HEK293 GSM416733
## 10 10 2219. chr15.trna2-LysCTT 10 HEK293 GSM416733
## # … with 5,705 more rows
## # A tibble: 11,430 x 4
## `Nucleotide Position` Experiment RPM gtRNA_ID_abbrev
## <dbl> <chr> <dbl> <chr>
## 1 1 HEK293 1874. TRK-CTT-1-2
## 2 1 HEK293 1874. TRK-CTT-1-2
## 3 2 HEK293 1886. TRK-CTT-1-2
## 4 2 HEK293 1886. TRK-CTT-1-2
## 5 3 HEK293 2008. TRK-CTT-1-2
## 6 3 HEK293 2008. TRK-CTT-1-2
## 7 4 HEK293 2082. TRK-CTT-1-2
## 8 4 HEK293 2082. TRK-CTT-1-2
## 9 5 HEK293 2108. TRK-CTT-1-2
## 10 5 HEK293 2108. TRK-CTT-1-2
## # … with 11,420 more rows
Tyrosine
With Tyr, we see many 22 and 25 nucleotide tRNA fragments that associate with AGO4 and AGO3 proteins in high abundance, higher than those acquired from the HEK experiments. Additionally, we see a higher association with AGO2 with Tyr than the other two tRNA fragment types shown previously.
## # A tibble: 6,530 x 6
## COUNT RPM `tRNA Gene` `Nucleotide Position` Experiment `Experiment#`
## <dbl> <dbl> <chr> <dbl> <chr> <chr>
## 1 1 419. chr8.trna5-TyrGTA 1 HEK293 GSM416733
## 2 2 421. chr8.trna5-TyrGTA 2 HEK293 GSM416733
## 3 3 422. chr8.trna5-TyrGTA 3 HEK293 GSM416733
## 4 4 422. chr8.trna5-TyrGTA 4 HEK293 GSM416733
## 5 5 423. chr8.trna5-TyrGTA 5 HEK293 GSM416733
## 6 6 424. chr8.trna5-TyrGTA 6 HEK293 GSM416733
## 7 7 424. chr8.trna5-TyrGTA 7 HEK293 GSM416733
## 8 8 426. chr8.trna5-TyrGTA 8 HEK293 GSM416733
## 9 9 426. chr8.trna5-TyrGTA 9 HEK293 GSM416733
## 10 10 426. chr8.trna5-TyrGTA 10 HEK293 GSM416733
## # … with 6,520 more rows
## # A tibble: 12,320 x 4
## `Nucleotide Position` Experiment RPM gtRNA_ID_abbrev
## <dbl> <chr> <dbl> <chr>
## 1 1 HEK293 419. TRY-GTA-5-2
## 2 1 HEK293 419. TRY-GTA-5-2
## 3 2 HEK293 421. TRY-GTA-5-2
## 4 2 HEK293 421. TRY-GTA-5-2
## 5 3 HEK293 422. TRY-GTA-5-2
## 6 3 HEK293 422. TRY-GTA-5-2
## 7 4 HEK293 422. TRY-GTA-5-2
## 8 4 HEK293 422. TRY-GTA-5-2
## 9 5 HEK293 423. TRY-GTA-5-2
## 10 5 HEK293 423. TRY-GTA-5-2
## # … with 12,310 more rows
Our next step is to analyze the tRFdb data using our own alignment pipeline. From here, we will be able to be more forgiving with our alignment to include mutations added by the reverse transcriptase. We can compare our results to those of tRFdb to determine whether a significant amount of tRNA fragments were thrown out due to the strict pipeline they used. This is important because reverse transcriptases are not always able to read through modification sites and assign the correct nucleotide, so the tRFdb site may be throwing out reads, since they will not match the alignment construct with 100% accuracy. From there, we can look at the MODOMICS website to determine whether our pipeline was better able to retain fragments that overlapped with a known m1A modification found at the 3’end of tRNAs.
Hafner, Markus, Markus Landthaler, Lukas Burger, Mohsen Khorshid, Jean Hausser, Philipp Berninger, Andrea Rothballer, et al. “Transcriptome-Wide Identification of RNA-Binding Protein and MicroRNA Target Sites by PAR-CLIP.” Cell 141, no. 1 (April 2, 2010): 129–41. https://doi.org/10.1016/j.cell.2010.03.009.
Kumar, Pankaj, Jordan Anaya, Suresh B. Mudunuri, and Anindya Dutta. “Meta-Analysis of TRNA Derived RNA Fragments Reveals That They Are Evolutionarily Conserved and Associate with AGO Proteins to Recognize Specific RNA Targets.” BMC Biology 12, no. 1 (October 1, 2014): 78. https://doi.org/10.1186/s12915-014-0078-0.
Kumar, Pankaj, Suresh B. Mudunuri, Jordan Anaya, and Anindya Dutta. “TRFdb: A Database for Transfer RNA Fragments.” Nucleic Acids Research 43, no. Database issue (January 28, 2015): D141–45. https://doi.org/10.1093/nar/gku1138.
Mayr, Christine, and David P. Bartel. “Widespread Shortening of 3′UTRs by Alternative Cleavage and Polyadenylation Activates Oncogenes in Cancer Cells.” Cell 138, no. 4 (August 21, 2009): 673–84. https://doi.org/10.1016/j.cell.2009.06.016.